Introduction

Website to visit our Diabetes Prediction data set

Within this analysis, we’ll investigate factors correlated to diabetes. With a data set of 100,000 people, this investigation allows us to display relations between ages, HbA1c levels, smoking history, and glucose levels. With a wide range of data points, we begin to question if there are trends within this data that match our general understanding of diabetes. Our goal is to asses which of the 9 variables play a stronger role to the development of diabetes and if we can prove trends to better support our assumptions of this data. Through data visualization, chart analysis, and numerical analysis we will be able to present this data to convince a general audience of the important factors that contribute to diabetic trends.

Libraries used

library(tidyverse) # Loaded for dplyr
library(ggplot2) # Loaded for plotting
library(plotly) # Loaded for interactive plots
library(readr) # Loaded to read in data
library(knitr) # Loaded to compute and display data
library(scales) # Loaded to scale data 
library(dplyr) # Loaded for data manipulation

Loading the data set

diabetes_dataset <- read_csv("diabetes_prediction_dataset.csv", show_col_types = FALSE)
glimpse(diabetes_dataset)
## Rows: 100,000
## Columns: 9
## $ gender              <chr> "Female", "Female", "Male", "Female", "Male", "Fem…
## $ age                 <dbl> 80, 54, 28, 36, 76, 20, 44, 79, 42, 32, 53, 54, 78…
## $ hypertension        <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease       <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history     <chr> "never", "No Info", "never", "current", "current",…
## $ bmi                 <dbl> 25.19, 27.32, 27.32, 23.45, 20.14, 27.32, 19.31, 2…
## $ HbA1c_level         <dbl> 6.6, 6.6, 5.7, 5.0, 4.8, 6.6, 6.5, 5.7, 4.8, 5.0, …
## $ blood_glucose_level <dbl> 140, 80, 158, 155, 155, 85, 200, 85, 145, 100, 85,…
## $ diabetes            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Data variables

Gender: The biological sex of the individual.

Age: Refers to the age of the individual.

Hypertension: Indicates whether individuals have this condition.

Heart disease: Indicates whether individuals have this condition.

Smoking history: Identifies individual’s smoking history.

BMI: A measure of body fat based on weight and height.

HbA1c level: Refers to the measure of a person’s average blood sugar level over the past 2-3 months.

Blood glucose level: Refers to the amount of glucose in the bloodstream at a given time.

Diabetes: Variable being predicted, with values of 1 indicating the presence of diabetes and 0 indicating no presence.

Male vs. Female Blood Sugar Levels (HbA1c) Plot

Can we predict diabetes status based on blood sugar levels?

For our first plot we filtered our data to categorize males and females as diabetic, pre-diabetic, and normal based on blood sugar levels(HbA1c). To do this we selected the variables of interest which are gender, diabetes, and HbA1c_level. From the first table shown we can see that blood sugar levels not necessarily define the diabetes status of the individuals within our data set.

# select, filter, and mutate the columns I want to work with 

mut_diabdat_for_BSL <- diabetes_dataset %>%
  select(gender, diabetes, HbA1c_level) %>%
  filter(gender != "Other") %>%       # we want to focus on male and female only
  mutate(HbA1c_category = case_when(  # categorize diabetes status based on BSL
    HbA1c_level < 5.7 ~ "Normal < 5.7%",
    HbA1c_level >= 5.7 & HbA1c_level <= 6.4 ~ "Prediabetic 5.7% - 6.4%",
    HbA1c_level >= 6.5 ~ "Diabetic ≥ 6.5%",
    TRUE ~ NA_character_
  ))

kable(head(mut_diabdat_for_BSL, 5), caption = "99,982 x 4 (first 5 rows)")
99,982 x 4 (first 5 rows)
gender diabetes HbA1c_level HbA1c_category
Female 0 6.6 Diabetic ≥ 6.5%
Female 0 6.6 Diabetic ≥ 6.5%
Male 0 5.7 Prediabetic 5.7% - 6.4%
Female 0 5.0 Normal < 5.7%
Male 0 4.8 Normal < 5.7%
# count, group_by, and mutate to obtain the number and percentage for each diabetic category for M and F

get_perc_count <- mut_diabdat_for_BSL %>% 
  count(gender, HbA1c_category) %>%  # new count col of the number of people within each category for M and F
  group_by(gender) %>%   # group by gender
  mutate(percent = n / sum(n) * 100) # get percentage of population within each diabetic category

# create my barplot, add x and y labels as well as color and text details. 
get_perc_count_plotly <- ggplot(get_perc_count, aes(x = gender, y = n, fill = HbA1c_category)) +
  geom_bar(stat = "identity", position = "fill", color = "black", linewidth = 0.1, alpha = 0.5) + 
  geom_text(aes(label = paste0(n, " (", round(percent, 1), "%)")),
            position = position_fill(vjust = 0.5), size = 3) +
  scale_fill_manual(values = c("Normal < 5.7%" = "cornsilk2", 
                               "Prediabetic 5.7% - 6.4%" = "darkkhaki", 
                               "Diabetic ≥ 6.5%" = "darkgoldenrod4")) +
  labs(title = "Male vs. Female Blood Sugar Levels (HbA1c)", 
       x = "Gender", y = "Proportion", fill = "HbA1c Category") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))+
  theme(
    plot.title = element_text(size = 13),
    axis.text = element_text(size = 11),
  )

# bar plot interactive using ggplotly

ggplotly(get_perc_count_plotly)
kable(head(get_perc_count, 5), caption = "6 x 4 (first 5 rows)") # showing the first 5 rows of the data set
6 x 4 (first 5 rows)
gender HbA1c_category n percent
Female Diabetic ≥ 6.5% 11835 20.21280
Female Normal < 5.7% 22492 38.41372
Female Prediabetic 5.7% - 6.4% 24225 41.37348
Male Diabetic ≥ 6.5% 8959 21.62443
Male Normal < 5.7% 15358 37.06976

Age Distribution in Diabetes, Heart Disease, and Hypertension Plot

How is age related to diabetes, heart disease, and hypertension?

To create this density plot we created 2 different groups of individuals, the first group are individuals with all 3 conditions which are diabetes, heart disease, and hypertension while the second group is free from all 3 conditions. After separating the 2 groups we brought it together to create a density plot and show the distribution within age. From this graph we concluded that there is a relationship between age and these 3 conditions. From the graph we observe that older individuals tend to have all 3 conditions.

# show a density plot to display age distribution based on diabetes and heart disease
# I create 2 new variables based on my original data set, I select and filter the columns I want to work with

DHH_only <- diabetes_dataset %>% # individuals with all 3 conditions
  select(age, diabetes, heart_disease, hypertension) %>%  
  filter(age >= 2, diabetes == 1, heart_disease == 1, hypertension == 1) %>%
  mutate(group = "Diabetes, H.D, and Hyp.")

kable(head(DHH_only, 5), caption = "358 x 5 (first 5 rows)") #showing the table
358 x 5 (first 5 rows)
age diabetes heart_disease hypertension group
57 1 1 1 Diabetes, H.D, and Hyp.
62 1 1 1 Diabetes, H.D, and Hyp.
62 1 1 1 Diabetes, H.D, and Hyp.
67 1 1 1 Diabetes, H.D, and Hyp.
72 1 1 1 Diabetes, H.D, and Hyp.
No_DHH_only <- diabetes_dataset %>% # individuals free from all 3 conditions
  select(age,heart_disease, diabetes, hypertension) %>%
  filter(age >= 2, diabetes == 0, heart_disease == 0, hypertension ==0) %>%
  mutate(group = "Free of Diabetes, H.D, and Hyp.")

kable(head(No_DHH_only, 5), caption = "81,885 x 5 (first 5 rows)") #showing the table
81,885 x 5 (first 5 rows)
age heart_disease diabetes hypertension group
54 0 0 0 Free of Diabetes, H.D, and Hyp.
28 0 0 0 Free of Diabetes, H.D, and Hyp.
36 0 0 0 Free of Diabetes, H.D, and Hyp.
20 0 0 0 Free of Diabetes, H.D, and Hyp.
79 0 0 0 Free of Diabetes, H.D, and Hyp.
# create the density plot, I name my x and y labels, title, color, fill, and plot size details
interactive_DHH <- ggplot() +
  geom_density(data = DHH_only, aes(x = age, fill = group), alpha = 0.5) +  # Diabetes cases
  geom_density(data = No_DHH_only , aes(x = age, fill = group), alpha = 0.5) +
  labs(title = "Age Distribution: Diabetes vs. Heart Disease",
       x = "Age",
       y = "Density") +
  scale_x_continuous(breaks = seq(2, 80, by = 6)) +
  scale_fill_manual(values = c("Diabetes, H.D, and Hyp." = "#660033",
                               "Free of Diabetes, H.D, and Hyp." = "#66CCFF")) +  # Custom colors
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(
    plot.title = element_text(size = 13),
    axis.text = element_text(size = 11),
  )

# here I make the bar plot interactive using ggplotly

ggplotly(interactive_DHH)

BMI Distribution by Hypertension Status Plot

Is there a correlation between body mass index and hypertension status?

To create this plot, we focused on the BMI and Hypertension variables from our data set. 0 representing no hypertension and 1 representing hypertension. From this graph we concluded that there is a correlation between BMI and hypertension status. We observe that individuals with hypertension tend to have higher BMI values compared to those without hypertension.

kable(head(diabetes_dataset, 5), caption = "10,000 x 9 (first 5 rows)") # display the diabetes_dataset table
10,000 x 9 (first 5 rows)
gender age hypertension heart_disease smoking_history bmi HbA1c_level blood_glucose_level diabetes
Female 80 0 1 never 25.19 6.6 140 0
Female 54 0 0 No Info 27.32 6.6 80 0
Male 28 0 0 never 27.32 5.7 158 0
Female 36 0 0 current 23.45 5.0 155 0
Male 76 1 1 current 20.14 4.8 155 0
overall_median_bmi <- median(diabetes_dataset$bmi, na.rm = TRUE) # new variable for the median of bmi to obtain horizontal red line

#create a violin plot, hypertension is selected as our x variable and bmi as our y variable
BMI_distribution <- ggplot(diabetes_dataset, aes(x = factor(hypertension), y = bmi, fill = factor(hypertension))) + # use factor so that hypertension can be treated as a factor
  geom_violin(alpha = 0.6) +
  geom_hline(yintercept = overall_median_bmi, linetype = "solid", color = "red", linewidth = 0.3) + #Line marks a 27 BMI which is classified as an overweight value
  scale_fill_manual(values = c("azure3", "#CC6600")) +
   scale_y_continuous(breaks = seq(10.01, 95.69, by = 5)) +
  labs(title = "BMI Distribution by Hypertension Status",
       x = "Hypertension Status (0 = No, 1 = Yes)",
       y = "BMI",
       fill = "Hypertension") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))+
  theme(
    plot.title = element_text(size = 13),
    axis.text = element_text(size = 11),
  )

# make our plot interactive with ggplotly
ggplotly(BMI_distribution)

Blood Glucose Levels by Diabetes Status (age 3-80) Plot

What’s the correlation between blood glucose levels and diabetes?

In this box plot we are comparing blood glucose levels by diabetes status, focusing on individuals within ages 3-80 because the youngest individual with diabetes is 3 years old and the oldest is 80. . From this plot we conclude that there is a strong correlation between both variables. From this graph we observe that individuals with higher blood glucose levels tend to fall within the diabetes status.

# these are the variables I worked with to create this table
table_diabetes_no_diabetes <- diabetes_dataset %>%
  select(age,diabetes, blood_glucose_level) %>%
  filter(age >= 3) %>% 
  mutate(diabetes = factor(diabetes, levels = c(0, 1), labels = c("No Diabetes", "Diabetes")))

# here I show the table
kable(head(table_diabetes_no_diabetes, 5), caption = "96,713 x 3 (first 5 rows)")
96,713 x 3 (first 5 rows)
age diabetes blood_glucose_level
80 No Diabetes 140
54 No Diabetes 80
28 No Diabetes 158
36 No Diabetes 155
76 No Diabetes 155
diabetes_and_no_diabetes_plot <- diabetes_dataset %>%
  select(blood_glucose_level,diabetes, age) %>%
  filter(age >= 3) %>% 
  mutate(diabetes = factor(diabetes, levels = c(0, 1), labels = c("No Diabetes", "Diabetes"))) %>% 
  ggplot(aes(x = diabetes, y = blood_glucose_level, fill = diabetes)) +
  geom_boxplot(alpha = 0.5) +
  labs(title = "Blood Glucose Levels by Diabetes Status (age 3-80)",
       x = "Diabetes Status",
       y = "Blood Glucose Level") +
  scale_fill_manual(values = c("Diabetes" = "#CC3300", "No Diabetes" = "#99CCFF")) +
  labs(fill = "Status") +
  scale_y_continuous(breaks = seq(80, 300, by = 20)) +
  geom_hline(yintercept = c(120, 140), linetype = "solid", color = "#0033FF", size = 0.2) +
  geom_hline(yintercept = c(200), linetype = "solid", color = "#CC3300", size = 0.3) +  # Mark High Range
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5))+
  theme(
    plot.title = element_text(size = 13),
    axis.text = element_text(size = 11),
  )

# make our plot interactive with ggplotly
ggplotly(diabetes_and_no_diabetes_plot)

BMI vs. Age Across Diabetes & Heart Disease Plot

Is BMI associated with diabetes, heart disease, and age across out data?

In this scatter plot we are comparing the age and BMI of 2 different groups of individuals; the first group is the group of individuals with diabetes only and the second group are individuals with heart disease. To achieve these results, we created 2 different tables for each group, then we combined the data to create our plot. In the graph we marked a red horizontal line representing the most common BMI for both groups. We concluded that there is a strong correlation between both conditions, BMI, and age. Higher BMI and older age are related to both conditions. We detected a high concentration of individuals who are 80 years old. This could be because diabetes and heart disease tend to be more common among older adults, besides that we discovered that our data set includes a disproportionately high number of individuals within the age of 80.

#diabetes_dataset

# here I create 2 variables that represent the data frames I'll be working on 
diabetes_only <- diabetes_dataset %>%
  select(age, bmi, diabetes) %>%
  filter(age >= 2, diabetes == 1) %>%
  mutate(condition = "Diabetes Only")

# here is a display the diabetes_only data frame
kable(head(diabetes_only, 5), caption = "8,500 x 4 (first 5 rows)")
8,500 x 4 (first 5 rows)
age bmi diabetes condition
44 19.31 1 Diabetes Only
67 27.32 1 Diabetes Only
50 27.32 1 Diabetes Only
73 25.91 1 Diabetes Only
53 27.32 1 Diabetes Only
heart_disease_only <- diabetes_dataset %>%
  select(age, bmi, heart_disease) %>%
  filter(age >= 2, heart_disease == 1) %>%
  mutate(condition = "Heart Disease Only")

# here is a display the heart_disease_only data frame
kable(head(heart_disease_only, 5), caption = "3,942 x 4 (first 5 rows)")
3,942 x 4 (first 5 rows)
age bmi heart_disease condition
80 25.19 1 Heart Disease Only
76 20.14 1 Heart Disease Only
72 27.94 1 Heart Disease Only
67 27.32 1 Heart Disease Only
77 32.02 1 Heart Disease Only
# next, I combine the 2 data frames to create a scatter plot
combined_dataaa <- bind_rows(diabetes_only, heart_disease_only)

# here I want to show that there is a common BMI between individuals with diabetes only and individuals with heart disease only, I will show this by drawing a red horizontal line across the plot
common_bmi <- combined_dataaa %>%
  count(bmi) %>% # count bmi
  arrange(desc(n))  # sort by highest count

most_common_bmi <- common_bmi$bmi[1] # extract the most frequent bmi to mark a red horizontal line

ggplot(combined_dataaa) +
  geom_point(aes(x = age, y = bmi, color = condition), alpha = 0.3) +  # Scatter plot with BMI
  geom_jitter(aes(x = age, y = bmi, color = condition), width = 0.1, height = 0.1, alpha = 0.3) +
 geom_hline(aes(yintercept = 27.32, color = "Common BMI = 27.32"), linetype = "solid", size = 0.3)  +  # horizontal line marking common BMI
  scale_color_manual(
    values = c("Diabetes Only" = "cornflowerblue", 
               "Heart Disease Only" = "darkorchid4", 
               "Common BMI = 27.32" = "red"),
    labels = c("Common BMI = 27.32", "Diabetes Only", "Heart Disease Only")) +
  scale_x_continuous(breaks = seq(0, 80, by = 5)) + 
  scale_y_continuous(breaks = seq(10.01, 95.69, by = 10)) +
  labs(title = "BMI vs. Age Across Diabetes & Heart Disease",
       x = "Age",
       y = "BMI",
       color = "Condition") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(
    axis.text = element_text(size = 11),
    legend.text = element_text(size = 11),
    plot.title = element_text(size = 13)
  )


A relation to BMI and Heart Disease

Each person within this scale has heart disease. Here a comparison is made between declared underweight and overweight people, grouped by sex, based on a BMI scale. There’s a significant increase in population percentage for those who are considered overweight and that have heart disease. With visual aid, it can be concluded that as weight increases, chances of heart disease will increase.

library(ggplot2)
library(dplyr)

# categories
bmi_heart_gender <- diabetes_dataset %>%
  filter(heart_disease == 1) %>%
  mutate(
    bmi_category = case_when(
      bmi <= 19 ~ "Underweight",
      bmi >= 30 ~ "Overweight",
      TRUE      ~ NA_character_
    )
  ) %>%
  filter(!is.na(bmi_category)) %>%
  group_by(gender, bmi_category) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(gender) %>%
  mutate(percentage = round(100 * count / sum(count), 1)) %>%
  ungroup()

bmi_heart_gender$bmi_category <- factor(bmi_heart_gender$bmi_category,
                                        levels = c("Underweight", "Overweight"))
max_ct <- max(bmi_heart_gender$count)
y_breaks <- pretty(c(0, max_ct), n = 10)   
#plot
ggplot(bmi_heart_gender,
       aes(x = bmi_category,
           y = count,
           group = gender,
           color = gender)) +
  geom_line(position = position_dodge(width = 0.5), linewidth = 1) +
  geom_point(position = position_dodge(width = 0.5), size = 5) +
  geom_text(aes(label = paste0(count, " (", percentage, "%)")),
            vjust = -1,
            position = position_dodge(width = 0.5)) +
  scale_y_continuous(
    breaks = y_breaks,
    expand = expansion(mult = c(0, .05))   # so labels don’t get clipped
  ) +
  labs(
    title = "Heart Disease Cases by Gender and BMI Category",
    x     = "BMI Category",
    y     = "Count of People with Heart Disease",
    color = "Gender"
  ) +
  theme_minimal()


An excpetion?

The data here is heavily dependent on BMI scale. It is important to note that BMI is not really a great determination for those who have diabetes, but there is a general trend within the data that people who have a BMI over 30 are more likely to be diabetic. In this graph we see that the percentages are that of people who are considered overweight and do not have diabetes. So, for the 18% represented of overweight men without diabetes there is 82% of overweight men with diabetes. The same logic applies to the data related to females.

library(dplyr)
library(ggplot2)

 male_data = diabetes_dataset %>% filter(gender == "Male")
 female_data = diabetes_dataset %>% filter(gender == "Female")
# Summarise men
men_summary <- male_data %>%
  summarise(
    count = sum(bmi >= 30 & diabetes == 0),
    total = n()
  ) %>%
  mutate(
    sex        = "Men",
    percent    = count / total * 100
  )
# Summarise women
women_summary <- female_data %>%
  summarise(
    count = sum(bmi >= 30 & diabetes == 0),
    total = n()
  ) %>%
  mutate(
    sex        = "Women",
    percent    = count / total * 100
  )
# Combine into one data frame
men_women_df = bind_rows(men_summary, women_summary)
#plot
ggplot(men_women_df, aes(x = sex, y = count, fill = sex)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = sprintf("%d (%.1f%%)", count, percent)),
            vjust = -0.5, size = 4) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    title = "Count of Overweight (BMI ≥ 30) without diabetes, by Sex",
    x     = NULL,
    y     = "Number of Individuals",
    fill  = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")


A relation to hypertension?

This depicts the different categories of HbA1c levels and their relation to a patients hypertension status

library(dplyr)
library(ggplot2)
library(scales)
#creating variables and assigning to dataset
df_summary <- diabetes_dataset %>%
  mutate(
    HbA1c_cat = case_when(
      HbA1c_level < 5.7                      ~ "< 5.7 (non diabetic)",
      HbA1c_level >= 5.7 & HbA1c_level < 6.5 ~ "5.7–6.4 (prediabetic)",
      HbA1c_level >= 6.5                    ~ "≥ 6.5 (diabetic)"
    ),
    hypertension = factor(hypertension, levels = c(0,1), #creating hypertension to a factor
                          labels = c("No", "Yes"))
  ) %>%
  group_by(HbA1c_cat, hypertension) %>%
  summarise(n = n(), .groups = "drop_last") %>%
  mutate(percent = n / sum(n))  # groups by HbA1c_cat
# plot
ggplot(df_summary, aes(x = HbA1c_cat,
                       y = percent,
                       fill = factor(hypertension))) +
  geom_col(position = "stack") +
  geom_text(aes(label = scales::percent(percent, 1)),
            position = position_stack(vjust = 0.5)) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_discrete(name = "Hypertension",
                      labels = c("No", "Yes")) +
  labs(
    title = "Hypertension Status by HbA1c Category",
    x     = "HbA1c Category",
    y     = "Percent within Category"
  ) +
  theme_light()


Diabetic range for men

This graph shows the population density of men based on diabetes status, based on age range. Within our data, it is seen that there is a peak seen where a large population of men are diagnosed with diabetes around the age of 65. It is also seen that diabetes occurs for a significant population starting around the age of 20.

library(ggplot2)
# assigning dataset
df_male <- diabetes_dataset %>% 
  filter(gender == "Male") %>% 
  mutate(diabetes = factor(diabetes, #making diabetes status as a factor
                           levels = c(0,1),
                           labels = c("Non‑diabetic","Diabetic")))
# ggplot code
male_diabetic_dens = ggplot(df_male, aes(x = age, fill = diabetes)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(
    name   = "Diabetes status",
    values = c("Non‑diabetic" = "lightpink", 
               "Diabetic"     = "#301934")
  ) +
  labs(
    title = "Age Distribution in Male Subjects",
    x     = "Age",
    y     = "Density"
  ) +
  theme_minimal()

ggplotly(male_diabetic_dens)

Diabetic range for women

This graph shows the population density of women based on diabetes status, based on age range. Within our data, it is seen that there is a peak seen where a large population of women are diagnosed with diabetes around the age of 65. It is also seen that diabetes occurs for a significant population starting around the age of 30.

library(ggplot2)
# assigning dataset
df_female <- diabetes_dataset %>% 
  filter(gender == "Female") %>% 
  mutate(diabetes = factor(diabetes, #making diabetes status as a factor
                           levels = c(0,1),
                           labels = c("Non‑diabetic","Diabetic")))
# ggplot code
 female_diabetic_dens = ggplot(df_female, aes(x = age, fill = diabetes)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(
    name   = "Diabetes status",
    values = c("Non‑diabetic" = "lightpink", 
               "Diabetic"     = "#309")
  ) +
  labs(
    title = "Age Distribution in Female Subjects",
    x     = "Age",
    y     = "Density"
  ) +
  theme_minimal()

 ggplotly(female_diabetic_dens)

Smoking

In the smoking data there are 6 unique values

  1. Never: Has Never smoked
  2. Not current: Has smoked but is not currently smoking
  3. Former: Has quit smoking (abstained for longer than)
  4. Current: Is currently a smoker
  5. Ever: Has ever smoked regardless of current smoking status
  6. No Info: No smoking history information available

The total amount of people who fall into each category is as follows;

  1. Never: 35095

  2. Not current: 6447

  3. Former: 9352

  4. Current: 9286

  5. Ever: 4004

  6. No Info: 35816

There is quite a sizable amount of people in the No info category.

The total number of people in the dataset is 100000. To help clean up the data, we can filter ‘No Info’ people out. When we do that we get 64184.

# Figure out the unique categories of smoking history
#unique(diabetes_dataset$smoking_history)

# Count amount of people who belong to each unique smoking category
#Omit No info

smoking_tally <- diabetes_dataset %>% filter(smoking_history != 'No Info') %>%  group_by(smoking_history) %>% summarise(total_people = n())

Sumarizing diabetes and smoking history

The data was then summarized to gather the total counts belonging to each smoking category and further grouped by diabetes status.

A percentage per smoking category with diabetes is then calculated dividing the count with diabetes by the total count in each smoking category.

#group diabetic vs non diabetic people together

smoking_diabetes_dataset <- diabetes_dataset %>%
  filter(smoking_history != 'No Info') %>%  
  group_by(smoking_history, diabetes) %>%  
  summarise(total = n())
## `summarise()` has grouped output by 'smoking_history'. You can override using
## the `.groups` argument.
# Inner join tally data with diabetic grouped data;
#mutate a column to create a percentage per category;
#select desired columns

smoking_diabetes_percentage <- inner_join(smoking_tally, smoking_diabetes_dataset, by = 'smoking_history') %>% mutate(Percentage = total/total_people) %>% select(smoking_history, diabetes, total, Percentage)

Graphing the Data

Now we can graph the relationship between smoking and diabetes as separated by smoking category.

#Create initial graph about smoking and diabetes

smoking_diabetes_graph <- ggplot(smoking_diabetes_percentage,
       aes(x    = smoking_history,
           y    = total,
           fill = factor(diabetes))) +
  geom_col(position = position_dodge(width = 0.9)) +
 
  # add percent labels
  geom_text(aes(label = percent(Percentage, accuracy = .1)),
            position = position_dodge(width = 1),
            vjust    = .4,        # nudge above bar
            size     = 3,
            hjust    = -.1) +
            
 
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, .1))) +
  labs(
    title   = "Smoking History and Diabetes Relationship",
    x       = "Smoking History",
    y       = "Count of People",
    fill    = "Has Diabetes",
    caption = "1 = diabetes, 0 = no diabetes"
  ) +
  
    scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#ffbc42"))


smoking_diabetes_graph

The graph reveals that 17% of former smokers have diabetes, a significantly higher percentage than all other categories. Never smokers have the lowest percentage, with 9.5% having diabetes, and the other 3 categories range between 10-12%. This shows that former smokers are the most likely to have diabetes compared to all other smoking history categories.

  1. Never: 9.5%

  2. Not current: 10.7%

  3. Former: 17%

  4. Current: 10.2%

  5. Ever: 11.8%

Heart disease and Smoking

This data also included information about heart disease status. The same filters and groupings applied to the diabetes and smoking data was applied with heart disease as the focus.

smoking_heart_disease_data <- diabetes_dataset %>%
  filter(smoking_history != 'No Info') %>%  
  group_by(smoking_history, heart_disease) %>%  
  summarise(total = n())
## `summarise()` has grouped output by 'smoking_history'. You can override using
## the `.groups` argument.
smoking_heart_disease_percentage <- inner_join(smoking_tally, smoking_heart_disease_data, by = 'smoking_history') %>% mutate(Percentage = round( total/total_people, digits = 4)) %>% select(smoking_history, heart_disease, total, Percentage)

Graphing relationship of Smoking History and Heart Disease

smoking_heart_disease_graph <- ggplot(smoking_heart_disease_percentage,
       aes(x    = smoking_history,
           y    = total,
           fill = factor(heart_disease))) +
  geom_col(position = position_dodge(width = 0.9)) +
 
  # add percent labels
  geom_text(aes(label = percent(Percentage, accuracy = .1)),
            position = position_dodge(width = 1),
            vjust    = .4,        # nudge above bar
            size     = 3,
            hjust    = -.1) +
            
 
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, .1))) +
  labs(
    title   = "Smoking History and Heart Disease Relationship",
    x       = "Smoking History",
    y       = "Count of People",
    fill    = "Has Heart Disease",
    caption = "1 = Heart Disease, 0 = No Heart Disease"
  ) +
  
    scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#577590"))


smoking_heart_disease_graph

The graph reveals a similar relationship, but at much lower percentages. Never smokers once again are at the lowest risk for heart disease at 3.1%, while former smokers sit at 9.7%. Not Current and Current are both at a similar percentage for heart disease, while we see a spike in ever smokers.

  1. Never: 3.1%

  2. Not current: 4.5%

  3. Former: 9.7%

  4. Current: 4.4%

  5. Ever: 7.8%

The lower percentages indicate less correlation between smoking and heart disease, however the trend of former smokers having higher risk for disease still exists.

Smoking and Hypertension

The data also contained information on hypertension status. Once again, the same filtering and grouping methods were applied to the dataset using the hypertension column.

smoking_hypertension_data <- diabetes_dataset %>%
  filter(smoking_history != 'No Info') %>%  
  group_by(smoking_history, hypertension) %>%  
  summarise(total = n())
## `summarise()` has grouped output by 'smoking_history'. You can override using
## the `.groups` argument.
smoking_hypertension_percentage <- inner_join(smoking_tally, smoking_hypertension_data, by = 'smoking_history') %>% mutate(Percentage = round( total/total_people, digits = 4)) %>% select(smoking_history, hypertension, total, Percentage)
smoking_hypertension_graph <- ggplot(smoking_hypertension_percentage,
       aes(x    = smoking_history,
           y    = total,
           fill = factor(hypertension))) +
  geom_col(position = position_dodge(width = 0.9)) +
 
  # add percent labels
  geom_text(aes(label = percent(Percentage, accuracy = .1)),
            position = position_dodge(width = 1),
            vjust    = .4,        # nudge above bar
            size     = 3,
            hjust    = -.1) +
            
 
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, .1))) +
   labs(
    title   = "Smoking History and Hypertension Relationship",
    x       = "Smoking History",
    y       = "Count of People",
    fill    = "Has Hypertension",
    caption = "1 = Hypertension, 0 = No Hypertension"
  ) +
  
  scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#f95d6a"))

smoking_hypertension_graph

The hypertension category reveals something a little different than the trend, as not current smokers have the lowest percentage of hypertension. Current and Never smokers are at essentially the same percentage, with current smokers having a slightly slower risk. Similar to the heart disease data, former smokers have the highest percentage of hypertension, and ever smokers following.

  1. Never: 9.1%

  2. Not current: 7.6%

  3. Former: 14.3%

  4. Current: 9.0%

  5. Ever: 10.5%

Age as a Factor

When looking at the data, there is a spike in former smokers risk for the three health complications discussed.

Questions arose; If you never quit smoking, do you maintain similar risk to people who never have smoked? What changes from current to former smokers?

To have been classified a former smoker, you at one point had to have been a current smoker; which signifies a change in age from current to former smoking status.

Age and Smoking History Category

A density plot of the age range by smoking category is created to visualize the ages where each smoking category range from.

smoking_data <- diabetes_dataset %>% filter(smoking_history != "No Info")

smoking_age_density <- ggplot(smoking_data, aes(x = age, color = as.factor(smoking_history))) + geom_density() +
  
  scale_x_continuous(breaks = seq(0, 80, by = 5),  
                     limits = c(0, 80)) +
  labs(
    title = "Smoking History Across Ages",
    x = "Age",
    color = "Smoking History"
  )
       
smoking_age_density       

The density plot shows an increase in former smokers from the 40-60 age range, while current smokers begin to decrease at around age 47. Similarly we see a decrease in all other categories at the 40-60 age range.

The Age and Health Factors

We can compare this with the density of people with diabetes, hypertension, and heart disease across all ages to see if there are similar spikes.

#AGE VS DIABETES
age_vs_diabetes <- ggplot(diabetes_dataset, 
                          aes(x = age, fill = as.factor(diabetes))
                          ) +
  
  scale_x_continuous(breaks = seq(0, 80, by = 5),  
                     limits = c(0, 80)) +
  
  geom_density(alpha = .5) +  
  
  labs(
    title   = "Age and Diabetes",
    x       = "Age",
    fill    = "Diabetes",
    caption = "1 = Diabetes, 0 = No Diabetes"
  ) +
  
  scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#ffbc42"))


#AGE VS HYPERTENSION
age_vs_hypertension <- ggplot(diabetes_dataset,
                              aes(x = age, fill = as.factor(hypertension))
                              ) + 
  
  geom_density(alpha = .5) + 
  
  scale_x_continuous(breaks = seq(0, 80, by = 5),  
                     limits = c(0, 80)) +
  
  labs(
    title   = "Age and Hypertension",
    x       = "Age",
    fill    = "Hypertension",
    caption = "1 = Hypertension, 0 = No Hypertension"
  ) +
  
    scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#f95d6a"))

  
# AGE VS HEART DISEASE
age_vs_heart_disease <- ggplot(smoking_data, 
                               aes(x = age, fill = as.factor(heart_disease))) +
  
  geom_density(alpha = .5) +
  
  scale_x_continuous(breaks = seq(0, 80, by = 5),  
                     limits = c(0, 80)) +
  labs(
    title   = "Age and Heart Disease",
    x       = "Age",
    fill    = "Heart Disease",
    caption = "1 = Heart Disease, 0 = No Heart Disease"
  ) +
  
    scale_fill_manual(values = c("0" = "#1aaf5d", "1" = "#577590"))


age_vs_heart_disease


age_vs_hypertension

With diabetes, heart disease, and hypertension, we see a increase at a similar age range to the former smokers spike, from 40-60. This also highlights how age factors into diabetic health risk. The density plots of age and the three health issues reveal a relationship between age; the older you get, the more likely you are to have diabetes, heart disease, and hypertension, with a big spike once you are over 40 years old.

Findings

Former smokers are at a higher risk for diabetes, hypertension, and heart disease

As people get older, their risk for disease increases

Although current smokers risk is not reflected through our percentage graph, further digging shows that around the age where current smokers decrease and former smokers increase is around the same age range that the risk for disease increases


Conclusion

Within this analysis we’ve scaled:

From this we were able to to effectively show and defend our general assumptions about diabetes and its co factors.

Limitations within our data